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Abstract 

The  correspondence  problem  in  computer  vision  is  basically  a  matching  task  between  two  or  more  sets 
of  features.  Computing  feature  correspondence  is  of  great  importance  in  computer  vision,  especially  in 
the  subhelds  of  object  recognition,  stereo,  and  motion.  In  this  paper,  we  introduce  a  vectorized  image 
representation,  which  is  a  feature-based  representation  where  correspondence  has  been  established  with 
respect  to  a  reference  image.  The  representation  consists  of  two  image  measurements  made  at  the  fea¬ 
ture  points:  shape  and  texture.  Feature  geometry,  or  shape,  is  represented  using  the  {x,y)  locations  of 
features  relative  to  the  some  standard  reference  shape.  Image  grey  levels,  or  texture,  are  represented  by 
mapping  image  grey  levels  onto  the  standard  reference  shape.  Computing  this  representation  is  essentially 
a  correspondence  task,  and  in  this  paper  we  explore  an  automatic  technique  for  “vectorizing”  face  images. 
Our  face  vectorizer  alternates  back  and  forth  between  computation  steps  for  shape  and  texture,  and  a 
key  idea  is  to  structure  the  two  computations  so  that  each  one  uses  the  output  of  the  other.  Namely,  the 
texture  computation  uses  shape  for  geometrical  normalization,  and  the  shape  computation  uses  the  tex¬ 
ture  analysis  to  synthesize  a  “reference”  image  for  Rnding  correspondences.  A  hierarchical  coarse-to-Rne 
implementation  is  discussed,  and  applications  are  presented  to  the  problems  of  facial  feature  detection 
and  registration  of  two  arbitrary  faces. 
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1  Introduction 


The  computation  of  correspondence  is  of  great  impor¬ 
tance  in  computer  vision,  especially  in  the  subhelds  of 
object  recognition,  stereo,  and  motion.  The  correspon¬ 
dence  problem  is  basically  a  matching  task  between  two 
or  more  sets  of  features.  In  the  case  of  object  recogni¬ 
tion,  one  set  of  features  comes  from  a  prior  object  model 
and  the  other  from  an  image  of  the  object.  In  stereo  and 
motion,  the  correspondence  problem  involves  matching 
features  across  different  images  of  the  object,  where  the 
images  may  be  taken  from  different  viewpoints  or  over 
time  as  the  object  moves.  Common  feature  points  are 
often  taken  to  be  salient  points  along  object  contours 
such  as  corners  or  vertices. 

A  common  representation  for  objects  in  recognition, 
stereo,  and  motion  systems  is  feature-based;  object  at¬ 
tributes  are  recorded  at  a  set  of  feature  points.  The 
set  of  feature  points  can  be  situated  in  either  3D  as  an 
object-centered  model  or  in  2D  as  a  view-centered  de¬ 
scription.  To  capture  object  geometry,  one  of  the  object 
attributes  recorded  at  each  feature  is  its  position  in  2D 
or  3D.  Additionally,  if  the  object  has  an  detailed  tex¬ 
ture,  one  may  be  interested  in  recording  the  local  surface 
albedo  at  each  feature  point  or  more  simply  the  image 
brightness.  Throughout  this  paper  we  refer  to  these  two 
attributes  respectively  as  shape  and  texture. 

Given  two  or  more  sets  of  features,  correspondence 
algorithms  match  features  across  the  feature  sets.  We 
dehne  a  vectorized  representation  to  be  a  feature- 
based  representation  where  correspondence  has  been  es¬ 
tablished  relative  to  a  Rxed  reference  object  or  reference 
image.  Computing  the  vectorized  representation  can  be 
thought  of  as  arranging  the  feature  sets  into  ordered  vec¬ 
tors  so  that  the  ith  element  of  each  vector  refers  to  the 
same  feature  point  for  all  objects.  Given  the  correspon¬ 
dences  in  the  vectorized  representation,  subsequent  pro¬ 
cessing  can  do  things  like  register  images  to  models  for 
recognition,  and  estimate  object  depth  or  motion. 

In  this  paper,  we  introduce  an  algorithm  for  comput¬ 
ing  the  vectorized  representation  for  a  class  of  objects 
like  the  human  face.  Faces  present  an  interesting  class 
of  objects  because  of  the  variation  seen  across  individu¬ 
als  in  both  shape  and  texture.  The  intricate  structure  of 
faces  leads  us  to  use  a  dense  set  of  features  to  describe  it. 
Once  a  dense  set  of  feature  correspondences  have  been 
computed  between  an  arbitrary  face  and  a  “reference” 
face,  applications  such  as  face  recognition  and  pose  and 
expression  estimation  are  possible.  However,  the  focus  of 
this  paper  is  on  an  algorithm  for  computing  a  vectorized 
representation  for  faces. 

The  two  primary  components  of  the  vectorized  rep¬ 
resentation  are  shape  and  texture.  Previous  approaches 
in  analyzing  faces  have  stressed  either  one  component  or 
the  other,  such  as  feature  localization  or  decomposing 
texture  as  a  linear  combination  of  eigenfaces  (see  Turk 
and  Pentland  [37]).  The  key  aspect  of  our  vectorization 
algorithm,  or  “vectorizer” ,  is  that  the  two  processes  for 
the  analysis  of  shape  and  texture  are  coupled.  That  is, 
the  shape  and  texture  processes  are  coupled  by  mak¬ 
ing  each  process  use  the  output  of  the  other.  The  tex¬ 
ture  analysis  uses  shape  for  geometrical  normalization. 
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and  shape  analysis  uses  texture  to  synthesize  a  refer¬ 
ence  image  for  feature  correspondence.  Empirically,  we 
have  found  that  this  links  the  two  processes  in  a  positive 
feedback  loop.  Iterating  between  the  shape  and  texture 
steps  causes  the  vectorized  representation  to  converge 
after  several  iterations. 

Our  vectorizer  is  similar  to  the  active  shape  model 
of  Gootes,  et  al.  [17]  [16]  [23]  in  that  both  iteratively  fit 
a  shape/texture  model  to  the  input.  But  there  are  in¬ 
teresting  differences  in  the  modeling  of  both  shape  and 
texture.  In  our  vectorizer  there  is  no  model  for  shape;  it 
is  measured  in  a  data-driven  manner  using  optical  flow. 
In  active  shape  models,  shape  is  modeled  using  a  para¬ 
metric,  example-based  method.  First,  an  ensemble  of 
shapes  are  processed  using  principal  component  analy¬ 
sis,  which  produces  a  set  of  “eigenshapes” .  New  shapes 
are  then  written  as  linear  combinations  of  these  eigen¬ 
shapes.  Texture  modeling  in  their  approach,  however, 
is  weaker  than  in  ours.  Texture  is  only  modeled  locally 
along  ID  contours  at  each  of  the  feature  points  defining 
shape.  Our  approach  models  texture  over  larger  regions 
-  such  as  eyes,  nose,  and  mouth  templates  -  which  should 
provide  more  constraint  for  textural  analysis.  In  the  fu¬ 
ture  we  intend  to  add  a  model  for  shape  similar  to  active 
shape  models,  as  discussed  ahead  in  section  6.2. 

In  this  paper,  we  start  in  section  2  by  first  providing  a 
more  concrete  definition  of  our  vectorized  shape  and  tex¬ 
ture  representation.  This  is  followed  by  a  more  detailed 
description  of  the  coupling  of  shape  and  texture.  Next, 
in  section  3,  we  present  the  basic  vectorization  method 
in  more  detail.  Section  4  discusses  a  hierarchical  coarse- 
to-Rne  implementation  of  the  technique.  In  section  5, 
we  demonstrate  two  applications  of  the  vectorizer,  facial 
feature  detection  and  the  registration  of  two  arbitrary 
faces.  The  latter  application  is  used  to  map  prototypical 
face  transformations  onto  a  face  so  that  new  “virtual” 
views  can  be  synthesized  (see  Beymer  and  Poggio  [11]). 
The  paper  closes  with  suggestions  for  future  work,  in¬ 
cluding  an  idea  to  generalize  the  vectorizer  to  multiple 
poses. 

2  Preliminaries 

2.1  Vectorized  representation 

As  mentioned  in  the  introduction,  the  vectorized  repre¬ 
sentation  is  a  feature-based  representation  where  corre¬ 
spondence  has  been  established  relative  to  a  Rxed  ref¬ 
erence  object  or  reference  image.  Gomputationally,  this 
requires  locating  a  set  of  features  on  an  object  and  bring¬ 
ing  them  into  correspondence  with  some  prior  reference 
feature  set.  While  it  is  possible  to  deRne  a  3D,  object- 
centered  vectorization,  the  vectorized  representation  in 
this  paper  will  be  based  on  2D  views  of  frontal  views  of 
the  face.  Thus,  the  representations  for  shape  and  tex¬ 
ture  of  faces  will  be  deRned  in  2D  and  measured  relative 
to  a  2D  reference  image. 

Since  the  representation  is  relative  to  a  2D  reference, 
Rrst  we  deRne  a  standard  feature  geometry  for  the  ref¬ 
erence  image.  The  features  on  new  faces  will  then  be 
measured  relative  to  the  standard  geometry.  In  this  pa¬ 
per,  the  standard  geometry  for  frontal  views  of  faces  is 


Figure  1:  To  define  the  shape  of  the  prototypes  off-line, 
manual  line  segment  features  are  used.  After  Beier  and 
Neely  [5]. 


Figure  2:  Manually  dehned  shapes  are  averaged  to  com¬ 
pute  the  standard  face  shape. 


dehned  by  averaging  a  set  of  line  segment  features  over 
an  ensemble  of  “prototype”  faces.  Fig.  1  shows  the  line 
segment  features  for  a  particular  individual,  and  Fig.  2 
shows  the  average  over  a  set  of  14  prototype  people.  Fea¬ 
tures  are  assigned  a  text  label  (e.g.  “ci”)  so  that  corre¬ 
sponding  line  segments  can  be  paired  across  images.  As 
we  will  explain  later  in  section  3.1,  the  line  segment  fea¬ 
tures  are  specihed  manually  in  an  initial  off-line  step  that 
dehnes  the  standard  feature  geometry. 

The  two  components  of  the  vectorized  representation, 
shape  and  texture,  can  now  be  dehned  relative  to  this 
standard  shape. 

2.1.1  Shape 

Given  the  locations  of  n  feature  points  /i,  /2,  .  .  . ,  /„ 
in  an  image  ia,  an  “absolute”  measure  of  2D  shape  is 
represented  by  a  vector  ja  of  length  2n  consisting  of  the 
concatenation  of  the  x  and  y  coordinate  values 


\  Vn  / 

This  absolute  representation  for  2D  shape  has  been 
widely  used,  including  network-based  object  recogni¬ 
tion  (Poggio  and  Edelman  [28]),  the  linear  combinations 
approach  to  recognition  (Ullman  and  Basri  [38],  Pog¬ 
gio  [27]),  active  shape  models  (Cootes  and  Taylor  [15], 


Figure  3:  Our  vectorized  representation  for  image  ia 
with  respect  to  the  reference  image  igtd  at  standard 
shape.  First,  pixelwise  correspondence  is  computed  be¬ 
tween  istd  and  ia,  as  indicated  by  the  grey  arrow.  Shape 
y^a-std  ^  vector  held  that  specihes  a  corresponding 
pixel  in  ia  for  each  pixel  in  igtd-  Texture  consists  of 
the  grey  levels  of  ia  mapped  onto  the  standard  shape. 


Cootes,  et  al.  [17])  and  face  recognition  (Craw  and 
Cameron  [18][19]). 

A  relative  shape  measured  with  respect  to  a  standard 
reference  shape  ystd  is  simply  the  difference 

y  a  y  stdi 

which  we  denote  using  the  shorthand  notation  ya-std- 
The  relative  shape  ya-std  is  the  difference  in  shape  be¬ 
tween  the  individual  in  ia  and  the  mean  face  shape. 

To  facilitate  shape  and  texture  operators  in  the  run¬ 
time  vectorization  procedure,  shape  is  spatially  oversam¬ 
pled.  That  is,  we  use  a  pixelwise  representation  for 
shape,  dehning  a  feature  point  at  each  pixel  in  a  subim¬ 
age  containing  the  face.  The  shape  vector  ya-std  can 
then  be  visualized  as  a  vector  held  of  correspondences 
between  a  face  at  standard  shape  and  the  given  image  ia 
being  represented.  If  there  are  n  pixels  in  the  face  subim¬ 
age  being  vectorized,  then  the  shape  vector  consists  of 
2n  values,  a  (6x,6y)  pair  for  each  pixel.  In  this  dense, 
pixelwise  representation  for  shape,  we  need  to  keep  track 
of  the  reference  image,  so  the  notation  is  extended  to  in¬ 
clude  the  reference  as  a  superscript  y^^^std-  ^^8-  ^  shows 
the  shape  representation  y^^^std  image  ia-  As  in¬ 

dicated  by  the  grey  arrow,  correspondences  are  measured 
relative  to  the  reference  face  igtd  at  standard  shape.  (Im¬ 
age  istd  in  this  case  is  mean  grey  level  image;  modeling 


grey  level  texture  is  discussed  more  in  section  3.1.)  Over¬ 
all,  the  advantage  of  using  a  dense  representation  is  that 
it  allows  a  simple  optical  flow  calculation  to  be  used  for 
computing  shape  and  a  simple  2D  warping  operator  for 
geometrical  normalization. 

2.1.2  Texture 

Our  texture  vector  is  a  geometrically  normalized  ver¬ 
sion  of  the  image  ia-  That  is,  the  geometrical  differences 
among  face  images  are  factored  out  by  warping  the  im¬ 
ages  to  the  standard  reference  shape.  This  strategy  for 
representing  texture  has  been  used,  for  example,  in  the 
face  recognition  works  of  Craw  and  Cameron  [18],  and 
Shackleton  and  Welsh  [33].  If  we  let  shape  ystd  be  the 
reference  shape,  then  the  geometrically  normalized  im¬ 
age  ta  is  given  by  the  2D  warp 

ia(x,  y)  =  ia{x  +  y),  y  +  y)), 

where  Axf_^,j^  and  Ayf."  are  the  x  and  y  components 
of  the  pixelwise  mapping  between  ya  and  the 

standard  shape  ystd-  Fig.  3  in  the  lower  right  shows  an 
example  texture  vector  for  the  input  image  ia  in  the 
upper  right. 

If  shape  is  sparsely  dehned,  then  texture  mapping 
or  sparse  data  interpolation  techniques  can  be  em¬ 
ployed  to  create  the  necessary  pixelwise  level  representa¬ 
tion.  Example  sparse  data  interpolation  techniques  in¬ 
clude  using  splines  (Litwinowicz  and  Williams  [24],  Wol- 
berg  [40]),  radial  basis  functions  (Reisfeld,  Arad,  and 
Yeshurun  [31]),  and  inverse  weighted  distance  metrics 
(Beier  and  Neely  [5]).  If  a  pixelwise  representation  is 
being  used  for  shape  in  the  Rrst  place,  such  as  one  de¬ 
rived  from  optical  flow,  then  texture  mapping  or  data 
interpolation  techniques  can  be  avoided. 

2.1.3  Separation  of  shape  and  texture 

How  cleanly  have  we  separated  the  notions  of  shape 
and  texture  in  the  2D  representations  just  described? 
Ideally,  the  ultimate  shape  description  would  be  a  3D 
one  where  the  (x,  y,  z)  coordinates  are  represented.  Tex¬ 
ture  would  be  a  description  of  local  surface  albedo  at 
each  feature  point  on  the  object.  Such  descriptions  are 
common  for  the  modeling  of  3D  objects  for  computer 
graphics,  and  it  would  be  nice  for  vision  algorithms  to 
invert  the  imaging  or  “rendering”  process  from  3D  mod¬ 
els  to  2D  images. 

What  our  2D  vectorized  description  has  done,  how¬ 
ever,  is  to  factor  out  and  explicitly  represent  the  salient 
aspects  of  2D  shape.  The  true  spatial  density  of  this 
2D  representation  depends,  of  course,  on  the  density  of 
features  defining  standard  shape,  shown  in  our  case  in 
Fig.  2.  Some  aspects  of  2D  shape,  such  as  lip  or  eyebrow 
thickness,  will  end  up  being  encoded  in  our  model  for 
texture.  However,  one  could  extend  the  standard  fea¬ 
ture  set  to  include  more  features  around  the  mouth  and 
eyebrows  if  desired.  For  texture,  there  are  non-albedo 
factors  confounded  in  the  texture  component,  such  as 
lighting  conditions  and  the  z-component  of  shape.  Over¬ 
all,  though,  remember  that  only  one  view  of  the  object 
being  vectorized  is  available,  thus  limiting  our  access  to 
3D  information.  We  hope  that  the  current  definitions  of 


Figure  4:  Vectorizing  face  images:  if  we  know  who  the 
person  is  and  have  prior  example  views  ia  of  their  face, 
then  we  can  manually  warp  ia  to  standard  shape,  pro¬ 
ducing  a  reference  New  images  of  the  person  can  be 
vectorized  by  computing  optical  flow  between  and  the 
new  input.  However,  if  we  do  not  have  prior  knowledge 
of  the  person  being  vectorized,  we  can  still  synthesize  an 
approximation  to  by  taking  a  linear  combination 

of  prototype  textures. 

shape  and  texture  are  a  reasonable  approximation  to  the 
desired  decomposition. 

2.2  Shape/texture  coupling 

One  of  the  main  results  of  this  paper  is  that  the  com¬ 
putations  for  the  shape  and  texture  components  can  be 
algorithmically  coupled.  That  is,  shape  can  be  used  to 
geometrically  normalize  the  input  image  prior  to  texture 
analysis.  Likewise,  the  result  of  texture  analysis  can  be 
used  to  synthesize  a  reference  image  for  finding  corre¬ 
spondences  in  the  shape  computation.  The  result  is  an 
iterative  algorithm  for  vectorizing  images  of  faces.  Let 
us  now  explore  the  coupling  of  shape  and  texture  in  more 
detail. 

2.2.1  Shape  perspective 

Since  the  vectorized  representation  is  determined  by 
an  ordered  set  of  feature  points,  computing  the  represen¬ 
tation  is  essentially  a  feature  finding  or  correspondence 
task.  Consider  this  correspondence  task  under  a  special 
set  of  circumstances:  we  know  who  the  person  is,  and  we 
have  prior  example  views  of  that  person.  In  this  case,  a 
simple  correspondence  finding  algorithm  such  as  optical 
flow  should  sufhce.  As  shown  in  the  left  two  images  of 
Fig.  4,  first  a  prior  example  ia  of  the  person’s  face  is 
manually  warped  in  an  off-line  step  to  standard  shape, 
producing  a  reference  image  A  new  image  of  the  same 
person  can  now  be  vectorized  simply  by  running  an  op¬ 
tical  flow  algorithm  between  the  image  and  reference  ta¬ 
li  we  have  no  prior  knowledge  of  the  person  being 
vectorized,  the  correspondence  problem  becomes  more 
difhcult.  In  order  to  handle  the  variability  seen  in  facial 
appearance  across  different  people,  one  could  imagine  us¬ 
ing  many  different  example  reference  images  that  have 
been  pre-warped  to  the  standard  reference  shape.  These 
reference  images  could  be  chosen,  for  example,  by  run¬ 
ning  a  clustering  algorithm  on  a  large  ensemble  of  exam¬ 
ple  face  images.  This  solution,  however,  introduces  the 
problem  of  having  to  choose  among  the  reference  images 
for  the  final  vectorization,  perhaps  based  on  a  confidence 
measure  in  the  correspondence  algorithm. 


Going  one  step  further,  in  this  paper  we  use  a  statis¬ 
tical  model  for  facial  texture  in  order  to  assist  the  corre¬ 
spondence  process.  Our  texture  model  relies  on  the  as¬ 
sumption,  commonly  made  in  the  eigenface  approach  to 
face  recognition  and  detection  (Turk  and  Pentland  [37], 
Pentland,  et  al.  [26]),  that  the  space  of  grey  level  images 
of  faces  is  linearly  spanned  by  a  set  of  example  views. 
That  is,  the  geometrically  normalized  texture  vector  ta 
from  the  input  image  ia  can  be  approximated  as  a  linear 
combination  of  n  prototype  textures  tp^ ,  1  <  i  <  « 

n 

ta  =  )  (1) 

i=i 

where  the  tp^  are  themselves  geometrically  normalized 
by  warping  them  to  the  standard  reference  shape.  The 
rightmost  image  of  Fig.  4,  for  example,  shows  an  ap¬ 
proximation  ta  that  is  generated  by  taking  a  linear  com¬ 
bination  of  textures  as  in  equation  (1).  If  the  vector- 
ization  procedure  can  estimate  a  proper  set  of  j3j  coefh- 
cients,  then  computing  correspondences  should  be  sim¬ 
ple.  Since  the  computed  “reference”  image  ta  approxi¬ 
mates  the  texture  ta  of  the  input  and  is  geometrically 
normalized,  we  are  back  to  the  situation  where  a  simple 
correspondence  algorithm  like  optical  flow  should  work. 
In  addition,  the  linear  j3j  coefficients  act  as  a  low  dimen¬ 
sional  code  for  representing  the  texture  vector  ta. 

This  raises  the  question  of  computing  the  j3j  coeffi¬ 
cients  for  the  texture  model.  Let  us  now  consider  the 
vectorization  procedure  from  the  perspective  of  model¬ 
ing  texture. 

2.2.2  Texture  perspective 

To  develop  the  vectorization  technique  from  the  tex¬ 
ture  perspective,  consider  the  simple  eigenimage,  or 
“eigenface”  ,  model  for  the  space  of  grey  level  face  images. 
The  eigenface  approach  for  modeling  face  images  has 
been  used  recently  for  a  variety  of  facial  analysis  tasks, 
including  face  recognition  (Turk  and  Pentland  [37],  Aka- 
matsu,  et  al.  [2],  Pentland,  et  al.  [26]),  reconstruction 
(Kirby  and  Sirovich  [22]),  face  detection  (Sung  and  Pog- 
gio  [35],  Moghaddam  and  Pentland  [25]),  and  facial  fea¬ 
ture  detection  (Pentland,  et  al.  [26]).  The  main  assump¬ 
tion  behind  this  modeling  approach  is  that  the  space  of 
grey  level  images  of  faces  is  linearly  spanned  by  a  set  of 
example  face  images.  To  optimally  represent  this  “face 
space” ,  principal  component  analysis  is  applied  to  the 
example  set,  extracting  an  orthogonal  set  of  eigenimages 
that  dehne  the  dimensions  of  face  space.  Arbitrary  faces 
are  then  represented  by  the  set  of  coefficients  computed 
by  projecting  the  face  onto  the  set  of  eigenimages. 

One  requirement  on  face  images,  both  for  the  exam¬ 
ple  set  fed  to  principal  components  and  for  new  images 
projected  onto  face  space,  is  that  they  be  geometrically 
normalized  so  that  facial  features  line  up  across  all  im¬ 
ages.  Most  normalization  methods  use  a  global  trans¬ 
form,  usually  a  similarity  or  affine  transform,  to  align 
two  or  three  major  facial  features.  For  example,  in  Pent¬ 
land,  et  al.  [26],  the  imaging  apparatus  effectively  regis¬ 
ters  eyes,  and  Akamatsu,  et  al.  [2]  register  the  eyes  and 
mouth. 


However,  because  of  the  inherent  variability  of  facial 
geometries  across  different  people,  aligning  just  a  couple 
of  features  -  such  as  the  eyes  -  leaves  other  features  mis¬ 
aligned.  To  the  extent  that  some  features  are  misaligned, 
even  this  normalized  representation  will  confound  differ¬ 
ences  in  grey  level  information  with  differences  in  local 
facial  geometry.  This  may  limit  the  representation’s  gen¬ 
eralization  ability  to  new  faces  outside  the  original  ex¬ 
ample  set  used  for  principal  components.  For  example,  a 
new  face  may  match  the  texture  of  one  particular  linear 
combination  of  eigenimages  but  the  shape  may  require 
another  linear  combination. 

To  decouple 

texture  and  shape.  Craw  and  Cameron  [18]  and  Shack- 
elton  and  Welsh  [33]  represent  shape  separately  and  use 
it  to  geometrically  normalize  face  texture  by  deforming 
it  to  a  standard  shape.  Shape  is  dehned  by  the  (x,y) 
locations  of  a  set  of  feature  points,  as  in  our  dehnition 
for  shape.  In  Craw  and  Cameron  [18],  76  points  outlin¬ 
ing  the  eyes,  nose,  mouth,  eyebrows,  and  head  are  used. 
To  geometrically  normalize  texture  using  shape,  image 
texture  is  deformed  to  a  standard  face  shape,  making 
it  “shape  free” .  This  is  done  by  Rrst  triangulating  the 
image  using  the  features  and  then  texture  mapping. 

However,  they  did  not  demonstrate  an  effec¬ 
tive  automatic  method  for  computing  the  vectorized 
shape/texture  representation.  This  is  mainly  due  to  diffi¬ 
culties  in  finding  correspondences  for  shape,  where  prob¬ 
ably  on  the  order  of  tens  of  features  need  to  be  located. 
Craw  and  Cameron  [18]  manually  locate  their  features. 
Shackelton  and  Welsh  [33],  who  focus  on  eye  images,  use 
the  deformable  template  approach  of  Yuille,  Cohen,  and 
Hallinan  [41]  to  locate  eye  features.  However,  for  19/60 
of  their  example  eye  images,  feature  localization  is  either 
rated  as  “poor”  or  “no  Rt” . 

Note  that  in  both  of  these  approaches,  computation  of 
the  shape  and  texture  components  have  been  separated, 
with  shape  being  computed  Rrst.  This  differs  from  our 
approach,  where  shape  and  texture  computations  are  in¬ 
terleaved  in  an  iterative  fashion.  In  their  approach  the 
link  from  shape  to  texture  is  present  -  using  shape  to 
geometrically  normalize  the  input.  But  using  a  texture 
model  to  assist  Rnding  correspondences  is  not  exploited. 

2.2.3  Combining  shape  and  texture 

Our  face  vectorizer  consists  of  two  primary  steps,  a 
shape  step  that  computes  vectorized  shape  and 

a  texture  step  that  uses  the  texture  model  to  approx¬ 
imate  the  texture  vector  ta.  Key  to  our  vectorization 
procedure  is  linking  the  two  steps  in  a  mutually  bene- 
Rcial  manner  and  iterating  back  and  forth  between  the 
two  until  the  representation  converges.  First,  consider 
how  the  result  of  the  texture  step  can  be  used  to  as¬ 
sist  the  shape  step.  Assuming  for  the  moment  that  the 
texture  step  can  provide  an  estimate  ta  using  equation 
(1),  then  the  shape  step  estimates  by  computing 

optical  Row  between  the  input  and  ta. 

Next,  to  complete  the  loop  between  shape  and  tex¬ 
ture,  consider  how  the  shape  can  be  used  to  com¬ 
pute  the  texture  approximation  ta.  The  shape  is 

used  to  geometrically  normalize  the  input  image  using 


the  backward  warp 

t„(x)  =  i„(x  +  yfj',jrf(x)), 

where  x  =  (x,  y)  is  a  2D  pixel  location  in  standard  shape. 
This  normalization  step  aligns  the  facial  features  in  the 
input  image  with  those  in  the  textures  .  Thus,  when 
ta  is  approximated  in  the  texture  step  by  projecting  it 
onto  the  linear  space  spanned  by  the  tp^ ,  facial  features 
are  properly  registered. 

Given  initial  conditions  for  shape  and  texture,  our 
proposed  system  switches  back  and  forth  between  tex¬ 
ture  and  shape  computations  until  a  stable  solution  is 
found.  Because  of  the  manner  in  which  the  shape  and 
texture  computations  feed  back  on  each  other,  improv¬ 
ing  one  component  improves  the  other:  better  corre¬ 
spondences  mean  better  feature  alignment  for  textural 
analysis,  and  computing  a  better  textural  approximation 
improves  the  reference  image  used  for  Rnding  correspon¬ 
dences.  Empirically,  we  have  found  that  the  representa¬ 
tion  converges  after  several  iterations. 

Now  that  we  have  seen  a  general  outline  of  our  vec- 
torizer,  let  us  explore  the  details. 

3  Basic  Vectorization  Method 

The  basic  method  for  our  vectorizer  breaks  down  into 
two  main  parts,  the  off-line  preparation  of  the  example 
textures  tp^ ,  and  the  on-line  vectorization  procedure  ap¬ 
plied  to  a  new  input  image. 

3.1  Off-line  preparation  of  examples 

The  basic  assumption  made  in  modeling  vectorized  tex¬ 
ture  is  that  the  space  of  face  textures  is  linearly  spanned 
by  a  set  of  geometrically  normalized  example  face  tex¬ 
tures.  Thus,  in  constructing  a  vectorizer  we  must  first 
collect  a  group  of  representative  faces  that  will  define 
face  space,  the  space  of  the  textural  component  in  our 
representation.  Before  using  the  example  faces  in  the 
vectorizer,  they  are  geometrically  normalized  to  align 
facial  features,  and  the  grey  levels  are  processed  using 
principal  components  or  the  pseudoinverse  to  optimize 
run-time  textural  processing. 

3.1.1  Geometric  normalization 

To  geometrically  normalize  an  example  face,  we  ap¬ 
ply  a  local  deformation  to  the  image  to  warp  the  face 
shape  into  a  standard  geometry.  This  local  deformation 
requires  both  the  shape  of  the  example  face  as  well  as 
some  definition  of  the  standard  shape.  Thus,  our  off-line 
normalization  procedure  needs  the  face  shape  component 
for  our  example  faces,  something  we  provide  manually. 
These  manual  correspondences  are  averaged  to  define  the 
standard  shape.  Finally,  a  2D  warping  operation  is  ap¬ 
plied  to  do  the  normalization.  We  now  go  over  these 
steps  in  more  detail. 

First,  to  define  the  shape  of  the  example  faces,  a  set  of 
line  segment  features  are  positioned  manually  for  each. 
The  features,  shown  in  Fig.  1,  follow  Beier  and  Neely’s  [5] 
manual  correspondence  technique  for  morphing  face  im¬ 
ages.  Pairing  up  image  feature  points  into  line  segments 
gives  one  a  natural  control  over  local  scale  and  rotation 


Figure  5:  Examples  of  off-line  geometrical  normalization 
of  example  images.  Texture  for  the  normalized  images  is 
sampled  from  the  original  images  -  that  is  why  the  chin 
is  generated  for  the  second  example. 

in  the  eventual  deformation  to  standard  shape,  as  we  will 
explain  later  when  discussing  the  deformation  technique. 

Next,  we  average  the  line  segments  over  the  example 
images  to  define  the  standard  face  shape  (see  Fig.  2). 
We  don’t  have  to  use  averaging  -  since  we  are  creating 
a  definition,  we  could  have  just  chosen  a  particular  ex¬ 
ample  face.  However,  averaging  shape  should  minimize 
the  total  amount  of  distortion  required  in  the  next  step 
of  geometrical  normalization. 

Finally,  images  are  geometrically  normalized  using  the 
local  deformation  technique  of  Beier  and  Neely  [5].  This 
deformation  technique  is  driven  by  the  pairing  of  line 
segments  in  the  example  image  with  line  segments  in 
the  standard  shape.  Consider  a  single  pairing  of  line 
segments,  one  segment  from  the  example  image  and 
one  from  the  standard  shape  Igtd-  This  line  segment 
pair  essentially  sets  up  a  local  transform  from  the  region 
surrounding  to  the  region  surrounding  Igtd-  The  local 
transform  resembles  a  similarity  transform  except  that 
there  is  no  scaling  perpendicular  to  the  segment,  just 
scaling  along  it.  The  local  transforms  are  computed  for 
each  segment  pair,  and  the  overall  warping  is  taken  as 
weighted  average.  Some  examples  of  images  before  and 
after  normalization  are  shown  in  Fig.  5. 

3.1.2  Texture  processing 

Now  that  the  example  faces  have  been  normalized  for 
shape,  they  can  be  used  for  texture  modeling.  Given  a 
new  input  ia,  the  texture  analysis  step  tries  to  approx¬ 
imate  the  input  texture  ta  as  a  linear  combination  of 
the  example  textures.  Of  course,  given  a  linear  subspace 
such  as  our  face  space,  one  can  choose  among  different 
sets  of  basis  vectors  that  will  span  the  same  subspace. 
One  popular  method  for  choosing  the  basis  set,  the  eigen- 
image  approach,  applies  principal  components  analysis 
to  the  example  set.  Another  potential  basis  set  is  simply 
the  original  set  of  images  themselves.  We  now  discuss 
the  off-line  texture  processing  required  for  the  two  basis 
sets  of  principal  components  and  the  original  images. 

Principal  components  analysis  is  a  classical  technique 
for  reducing  the  dimensionality  of  a  cluster  of  data 


points,  where  the  data  are  assumed  to  be  distributed 
in  an  ellipsoid  pattern  about  a  cluster  center.  If  there  is 
correlation  in  the  data  among  the  coordinate  axes,  then 
one  can  project  the  data  points  to  a  lower  dimensional 
subspace  without  losing  information.  This  corresponds 
to  an  ellipsoid  with  interesting  variation  along  a  num¬ 
ber  of  directions  that  is  less  than  the  dimensionality  of 
the  data  points.  Principal  components  analysis  Rnds  the 
lower  dimensional  subspace  inherent  in  the  data  points. 
It  works  by  finding  a  set  of  directions  Gj-  such  that  the 
variance  in  the  data  points  is  highest  when  projected 
onto  those  directions.  These  directions  are  computed 
by  finding  the  eigenvectors  of  the  of  the  covariance  ma¬ 
trix  of  the  data  points. 

In  our  ellipsoid  of  n  geometrically  normalized  textures 
tp^ ,  let  tp^  be  the  set  of  textures  with  the  mean  tmean 
subtracted  off 


i=i 

—  ^pj  ^mean-}  ^  ^  j  ^ 

If  we  let  T  be  a  matrix  where  the  jth  column  is  t' 

T  =  [tpj  tp^  ■  ■  ■  tp^]  , 

then  the  covariance  matrix  is  defined  as 


Pj 


E  =  TT*. 


Notice  that  T  is  a  mxn  matrix,  where  m  is  the  number  of 
pixels  in  vectorized  texture  vectors.  Due  to  our  pixelwise 
representation  for  shape,  m  n  and  thus  E,  which  is 
a  m  X  m  matrix,  is  quite  large  and  may  be  intractable 
for  eigenanalysis.  Fortunately,  one  can  solve  the  smaller 
eigenvector  problem  for  the  n  x  n  matrix  T*T.  This  is 
possible  because  an  eigenvector  of  T*T 

T*T  Si  =  XiSi 


corresponds  to  an  eigenvector  Te^  of  E.  This  can  be 
seen  by  multiplying  both  sides  of  the  above  equation  by 
matrix  T 

(TT*)  Tei  =  XiTsi. 

Since  the  eigenvectors  (or  eigenimages)  with  the  larger 
eigenvalues  Xi  explain  the  most  variance  in  the  example 
set,  only  a  fraction  of  the  eigenimages  need  to  be  retained 
for  the  basis  set.  In  our  implementation,  we  chose  to  use 
roughly  half  the  eigenimages.  Fig.  6  shows  the  mean  face 
and  the  first  6  eigenimages  from  a  principal  components 
analysis  applied  to  a  group  of  55  people. 

Since  the  eigenimages  are  orthogonal  (and  can  easily 
be  normalized  to  be  made  orthonormal),  analysis  and  re¬ 
construction  of  new  image  textures  during  vectorization 
can  be  easily  performed.  Say  that  we  retain  N  eigenim¬ 
ages,  and  let  be  a  geometrically  normalized  texture 
to  analyze.  Then  the  run-time  vectorization  procedure 
projects  ta  onto  the  e^ 


f^i  —  '  (ffl  fmean) 

and  can  reconstruct  yielding 


=  tr 


+  Zlfcl 


(2) 


Figure  6:  Mean  image  and  eigenimages  from  applying 
principal  components  analysis  to  the  geometrically  nor¬ 
malized  examples. 


Another  potential  basis  set  is  the  original  example 
textures  themselves.  That  is,  we  approximate  by  a 
linear  combination  of  the  n  original  image  textures  tp^ 

ta  =  =  l  •  (4) 

While  we  do  not  need  to  solve  this  equation  until  on¬ 
line  vectorization,  previewing  the  solution  will  elucidate 
what  needs  to  be  done  for  off-line  processing.  Write 
equation  (4)  in  matrix  form 

ta  =  T  /3,  (5) 

where  is  written  as  a  column  vector,  T  is  a  matrix 

where  the  ith  column  is  tp^,  and  [3  is  a  column  vector  of 
the  /3j  ’s.  Solving  this  with  linear  least  squares  yields 

P  =  TUa  (6) 

=  (t*T)-1T*  ta  (7) 

where  Tt  =  is  the  pseudoinverse  of  T.  The 

pseudoinverse  can  be  computed  off-line  since  it  depends 
only  on  the  example  textures  tp^.  Thus,  run-time  vec¬ 
torization  performs  texture  analysis  with  the  columns  of 
Tt  (equation  (6))  and  reconstruction  with  the  columns 
of  T  (equation  (5)).  Fig.  7  shows  some  example  images 
processed  by  the  pseudoinverse  where  n  was  40. 

Note  that  for  both  basis  sets,  the  linear  coefficients  are 
computed  using  a  simple  projection  operation.  Coding- 
wise  at  run-time,  the  only  difference  is  whether  one  sub¬ 
tracts  off  the  mean  image  tmean-  In  practice  though, 
the  eigenimage  approach  will  require  fewer  projections 
since  not  all  eigenimages  are  retained.  Also,  the  orthog¬ 
onality  of  the  eigenimages  may  produce  a  more  stable 
set  of  linear  coefficients  -  consider  what  happens  for  the 
pseudoinverse  approach  when  two  example  images  are 
similar  in  texture.  Yet  another  potential  basis  set,  one 
that  has  the  advantage  of  orthogonality,  would  be  the 
result  of  applying  Gram-Schmidt  orthonormalization  to 
the  example  set. 

Most  of  our  vectorization  experiments  have  been  with 
the  eigenimage  basis,  so  the  notation  in  the  next  section 
uses  this  basis  set. 


t 


(3) 
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3.2  Run-time  vectorization 

In  this  section  we  go  over  the  details  of  the  vectorization 
procedure.  The  inputs  to  the  vectorizer  are  an  image  ia 


Figure  7:  Example  textures  processed  by  the  pseudoin¬ 
verse  =  (T*T)  When  using  the  original  set  of 

image  textures  as  a  basis,  texture  analysis  is  performed 
by  projection  onto  these  images. 


to  vectorize  and  a  texture  model  consisting  of  N  eigen- 
images  Gi  and  mean  image  tmean-  In  addition,  the  vec- 
torizer  takes  as  input  a  planar  transform  P  that  selects 
the  face  region  from  the  image  ia  and  normalizes  it  for 
the  effects  of  scale  and  image-plane  rotation.  The  pla¬ 
nar  transform  P  can  be  a  rough  estimate  from  a  coarse 
scale  analysis.  Since  the  faces  in  our  test  images  were 
taken  against  a  solid  background,  face  detection  is  rel¬ 
atively  easy  and  can  be  handled  simply  by  correlating 
with  a  couple  face  templates.  The  vectorization  proce¬ 
dure  rehnes  the  estimate  P ,  so  the  Rnal  outputs  of  the 
procedure  are  the  vectorized  shape  a  set  of  jSi 

coefhcients  for  computing  ta,  and  a  refined  estimate  of 
P. 

As  mentioned  previously,  the  interconnectedness  of 
the  shape  and  texture  steps  makes  the  iteration  con¬ 
verge.  Fig.  8  depicts  the  convergence  of  the  vectoriza¬ 
tion  procedure  from  the  perspective  of  texture.  There 
are  three  sets  of  face  images  in  the  figure,  sets  of  (1)  all 
face  images,  (2)  geometrically  normalized  face  textures, 
and  (3)  the  space  of  our  texture  model.  The  difference 
between  the  texture  model  space  and  the  set  of  geomet¬ 
rically  normalized  faces  depends  on  the  prototype  set  of 
n  example  faces.  The  larger  and  more  varied  this  set  be¬ 
comes,  the  smaller  the  difference  becomes  between  sets 
(2)  and  (3).  Here  we  assume  that  the  texture  model  is 
not  perfect,  so  the  true  ta  is  slightly  outside  the  texture 
model  space. 

The  goal  of  the  iteration  is  to  make  estimates  of  ta 
and  ta  converge  to  the  true  ta.  The  path  for  ta,  the 
geometrically  normalized  version  of  ia ,  is  shown  by  the 
curve  from  ia  to  the  final  ta.  The  path  for  ta  is  shown 
by  the  curve  from  initial  ta  to  final  ta.  The  texture  and 
shape  steps  are  depicted  by  the  arrows  jumping  between 
the  curves.  The  texture  step,  using  the  latest  estimate  of 
shape  to  produce  ta,  projects  ta  into  the  texture  model 
space.  The  shape  step  uses  the  latest  ta  to  find  a  new 
set  of  correspondences,  thus  updating  shape  and  hence 
ta.  As  one  moves  along  the  ta  curve,  one  is  getting 
better  estimates  of  shape.  As  one  moves  along  the  ta 


curve,  the  l3i  coefhcients  in  the  texture  model  improve. 
Since  the  true  ta  lies  outside  the  texture  model  space, 
the  iteration  stops  at  final  ta.  This  error  can  be  made 
smaller  by  increasing  the  number  of  prototypes  for  the 
texture  model. 

We  now  look  at  one  iteration  step  in  detail. 

3.2.1  One  iteration 

In  examining  one  iteration  of  the  texture  and  shape 
steps,  we  assume  that  the  previous  iteration  has  pro¬ 
vided  an  estimate  for  y^a-std  1^®  Pi  coefhcients.  For 
the  first  iteration,  an  initial  condition  of  =  0  is 

used.  No  initial  condition  is  needed  for  texture  since  the 
iteration  starts  with  the  texture  step. 

In  the  textnre  step,  first  the  input  image  ia  is  geo¬ 
metrically  normalized  using  the  shape  estimate  ypP^f^, 
producing  ta 

ta(x)  =  ia(x  -b  yfyjrf(x)),  (8) 

where  x  =  (x,  y)  is  a  pixel  location  in  the  standard  shape. 
This  is  implemented  as  a  backwards  warp  using  the  how 
vectors  pointing  from  the  standard  shape  to  the  input. 
Bilinear  interpolation  is  used  to  sample  ia  at  non-integral 
{x,y)  locations.  Next,  ta  is  projected  onto  the  eigenim- 
ages  Gi  using  equation  (2)  to  update  the  linear  coefh¬ 
cients  j3i.  These  updated  coefhcients  should  enable  the 
shape  computation  to  synthesize  an  approximation  ta 
that  is  closer  to  the  true  ta. 

In  the  shape  step,  first  a  reference  image  ta  is  syn¬ 
thesized  from  the  texture  coefhcients  using  equation  (3). 
Since  the  reference  image  reconstructs  the  texture  of  the 
input,  it  should  be  well  suited  for  finding  shape  corre¬ 
spondences.  Next,  optical  how  is  computed  between  ta, 
which  is  geometrically  normalized,  and  ia,  which  updates 
the  pixelwise  correspondences  For  optical  how, 

we  used  the  gradient-based  hierarchical  scheme  of  Bergen 
and  Adelson  [7],  Bergen  and  Hingorani  [9],  and  Bergen, 
et  al.  [8].  The  new  correspondences  should  provide  bet¬ 
ter  geometrical  normalization  in  the  next  texture  step. 

Overall,  iterating  these  steps  until  the  representa¬ 
tion  stabilizes  is  equivalent  to  iteratively  solving  for  the 
yp^Pstd  Pi  which  best  satisfy 


or 

*a(x  -|-  y a-stdi^^))  ~  ^mean  +  Xyfcl  Pi^i- 
3.2.2  Adding  a  global  transform 

We  introduce  a  planar  transform  P  to  select  the  image 
region  containing  the  face  and  to  normalize  the  face  for 
the  effects  of  scale  and  image-plane  rotation.  Let  i'a  be 
the  input  image  ia  resampled  under  the  planar  transform 
P 

=  *'a(^(x)).  (9) 

It  is  this  resampled  image  i'a  that  will  be  geometrically 
normalized  in  the  texture  step  and  used  for  optical  how 
in  the  shape  step. 

Besides  selecting  the  face,  the  transform  P  will  also  be 
used  for  selecting  subimages  around  individual  features 
such  as  the  eyes,  nose,  and  mouth.  As  will  be  explained 
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Figure  8:  Convergence  of  the  vectorization  procedure  with  regards  to  texture.  The  texture  and  shape  steps  try  to 
make  ta  and  ta  converge  to  the  true  ta. 


in  the  next  section  on  our  hierarchical  implementation, 
the  vectorization  procedure  is  applied  in  a  coarse-to-hne 
strategy  on  a  pyramid  structure.  Full  face  templates  are 
vectorized  at  the  coarser  scales  and  individual  feature 
templates  are  vectorized  at  the  hner  scales. 

Transform  P  will  be  a  similarity  transform 

where  the  scale  s,  image-plane  rotation  9,  and  2D  trans¬ 
lation  (tx,  ty)  are  determined  in  one  of  two  ways,  depend¬ 
ing  on  the  region  being  vectorized. 

1.  Two  point  correspondences.  Dehne  anchor  points 
qstd,i  and  qstd,2  in  standard  shape,  which  can  be 
done  manually  in  off-line  processing.  Let  qa_i  and 
q_a,2  be  estimates  of  the  anchor  point  locations  in 
the  image  ia ,  estimates  which  need  to  be  performed 
on-line.  The  similarity  transform  parameters  are 
then  determined  such  that 

-P(qstd,l)  =  qa.l,  P((lstd,2)  =  (la,2-  (10) 

This  uses  the  full  flexibility  of  the  similarity  trans¬ 
form  and  is  used  when  the  image  region  being  vec¬ 
torized  contains  two  reliable  feature  points  such  as 
the  eyes. 

2.  Fixed  s,  6,  and  one  point  correspondence.  In  this 
case  there  is  only  one  anchor  point  qstdp,  and  one 
solves  for  t^  and  ty  such  that 

—  qa,i*  (11) 

This  is  useful  for  vectorizing  templates  with  less 
reliable  features  such  as  the  nose  and  mouth.  For 
these  templates  the  eyes  are  vectorized  Rrst  and 
used  to  fix  the  scale  and  rotation  for  the  nose  and 
mouth. 


While  the  vectorizer  assumes  that  a  face  finder  has 
provided  an  initial  estimate  for  P,  we  would  like  the 
vectorizer  to  be  insensitive  to  a  coarse  or  noisy  estimate 
and  to  improve  the  estimate  of  P  during  vectorization. 
The  similarity  transform  P  can  be  updated  during  the 
iteration  when  our  estimates  change  for  the  positions  of 
the  anchor  points  qa_i.  This  can  be  determined  after 
the  shape  step  computes  a  new  estimate  of  the  shape 
y^a-std-  that  an  anchor  point  estimate  is  off 

when  there  is  nonzero  flow  at  the  anchor  point 

||y0td(q.td.i)||  >  threshold. 

The  correspondences  can  be  used  to  update  the  anchor 
point  estimate 

qa,8  =  P{^.std,i  +  y a-std{^std,i))  ■ 

Next,  P  can  be  updated  using  the  new  anchor  point  loca¬ 
tions  using  equation  (10)  or  (11)  and  ia  can  be  resampled 
again  using  equation  (9)  to  produce  a  new 

3.2.3  Entire  procedure 

The  basic  vectorization  procedure  is  now  summarized. 
Lines  2(a)  and  (b)  are  the  texture  step,  lines  2(c)  and  (d) 
are  the  shape  step,  and  line  2(e)  updates  the  similarity 
transform  P. 

procedure  vectorize 
1.  initialization 

(a)  Estimate  P  using  a  face  detector.  For  exam¬ 
ple,  a  correlational  face  finder  using  averaged 
face  templates  can  be  used  to  estimate  the 
translational  component  of  P. 

(b)  Resample  ia  using  the  similarity  transform  P , 
producing  i'a  (equation  (9)). 


(c)  =  0. 


2.  iteration:  solve  for  jSi,  and  P  by  iterating 

the  following  steps  until  the  l3i  stop  changing. 

(a)  Geometrically  normalize  using  pro¬ 

ducing  ta 

ta(x)  =  yf^tdW). 

(b)  Project  ta  onto  example  set  ei,  computing  the 
linear  coefhcients  jSi 

f^i  —  '  (ffl  ^mean}  ^  1  ^  ^  U. 

(c)  Compute  reference  image  ta  for  correspon¬ 
dence  by  reconstructing  the  geometrically 
normalized  input 

ffl  —  fmean  “b  ^^2  =  1 

(d)  Compute  the  shape  component  using  optical 
flow 

Ya-.td  =  Optical-flow(i'a,  ta). 

(e)  If  the  anchor  points  are  misaligned,  as  indi¬ 
cated  by  optical  flow,  then: 

i.  Update  P  with  new  anchor  points. 

ii.  Resample  ia  using  the  similarity  trans¬ 
form  P,  producing  (eqn  (9)). 

iii.  ya-,td  =  optical-flow(i'a,  ta). 

Fig.  9  shows  snapshot  images  of  ta,  and  ta  during 
each  iteration  of  an  example  vectorization.  The  iteration 
number  is  shown  in  the  left  column,  and  the  starting  in¬ 
put  is  shown  in  the  upper  left.  We  deliberately  provided 
a  poor  initial  alignment  for  the  iteration  to  demonstrate 
the  procedure’s  ability  to  estimate  the  similarity  trans¬ 
form  P.  As  the  iteration  proceeds,  notice  how  (1)  im¬ 
provements  in  P  lead  to  a  better  global  alignment  in  i'^, 

(2)  the  geometrically  normalized  image  ta  improves,  and 

(3)  the  image  ta  becomes  a  more  faithful  reproduction 
of  the  input.  The  additional  row  for  i'a  is  given  because 
when  step  2(e)  is  executed  in  the  last  iteration,  i'a  is 
updated. 

3.3  Pose  dependence  from  the  example  set 

The  example  images  we  have  used  in  the  vectorizer  so 
far  have  been  from  a  frontal  pose.  What  about  other 
poses,  poses  involving  rotations  out  of  the  image  plane? 

Because  we  are  being  careful  about  geometry  and  cor¬ 
respondence,  the  example  views  used  to  construct  the 
vectorizer  must  be  taken  from  the  same  out-of-plane  im¬ 
age  rotation.  The  resulting  vectorizer  will  be  tuned  to 
that  pose,  and  performance  is  expected  to  drop  as  an 
input  view  deviates  from  that  pose.  The  only  thing  that 
makes  the  vectorizer  pose-dependent,  however,  is  the  set 
of  example  views  used  to  construct  face  space.  The  it¬ 
eration  step  is  general  and  should  work  for  a  variety  of 
poses.  Thus,  even  though  we  have  chosen  a  frontal  view 
as  an  example  case,  a  vectorizer  tuned  for  a  different 
pose  can  be  constructed  simply  by  using  example  views 
from  that  pose. 


C 

Figure  9:  Snapshot  images  of  ta,  and  ta  during  the 
three  iterations  of  an  example  vectorization.  See  text  for 
details. 

In  section  5.1  on  applying  the  vectorizer  to  feature  de¬ 
tection,  we  demonstrate  two  vectorizers,  one  tuned  for 
a  frontal  pose,  and  one  for  an  off-frontal  pose.  Later, 
in  section  6.3,  we  suggest  a  multiple-pose  vectorizer  that 
connects  different  pose-specihc  vectorizers  through  inter¬ 
polation. 

4  Hierarchical  implementation 

For  optimization  purposes,  the  vectorization  procedure 
is  implemented  using  a  coarse-to-Rne  strategy.  Given 
an  input  image  to  vectorize,  first  the  Gaussian  pyramid 
(Burt  and  Adelson  [14])  is  computed  to  provide  a  mul¬ 
tiresolution  representation  over  4  scales,  the  original  im¬ 
age  plus  3  reductions  by  2.  A  face  finder  is  then  run 
over  the  coarsest  level  to  provide  an  initial  estimate  for 
the  similarity  transform  P.  Next,  the  vectorizer  is  run 
at  each  pyramid  level,  working  from  the  coarser  to  finer 
levels.  As  processing  moves  from  a  coarser  level  to  a 
finer  one,  the  coarse  shape  correspondences  are  used  to 
initialize  the  similarity  transform  P  for  the  vectorizer  at 
the  finer  level. 

4.1  Face  finding  at  coarse  resolntion 

For  our  test  images,  face  detection  is  not  a  major  prob¬ 
lem  since  the  subjects  are  shot  against  a  uniform  back¬ 
ground.  For  the  more  general  case  of  cluttered  back¬ 
grounds,  see  the  face  detection  work  of  Reisfeld  and 
Yeshurun  [32],  Ben-Arie  and  Rao  [6],  Sung  and  Pog- 
gio  [35],  Sinha  [34],  and  Moghaddam  and  Pentland  [25]. 
For  our  test  images,  we  found  that  normalized  correla¬ 
tion  using  two  face  templates  works  well.  The  normal¬ 
ized  correlation  metric  is 

<TI>-<T><I> 

<t(T)^(I)  ’ 

where  T  is  the  template,  I  is  the  subportion  of  image  be¬ 
ing  matched  against,  <  TI  >  is  normal  correlation,  <> 
is  the  mean  operator,  and  measures  standard  devia¬ 
tion.  The  templates  are  formed  by  averaging  face  grey 
levels  over  two  populations,  an  average  of  all  examples 
plus  an  average  over  people  with  beards.  Before  aver¬ 
aging,  example  face  images  are  first  warped  to  standard 
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Figure  10:  Face  finding  templates  are  grey  level  averages 
using  two  populations,  all  examples  (left)  plus  people 
with  beards  (right). 

shape.  Our  two  face  templates  for  a  frontal  pose  are 
shown  in  Fig.  10.  To  provide  some  invariance  to  scale, 
regions  with  high  correlation  response  to  these  templates 
are  examined  with  secondary  correlations  where  the  scale 
parameter  is  both  increased  and  decreased  by  20%.  The 
location/scale  of  correlation  matches  above  a  certain 
threshold  are  reported  to  the  vectorizer. 

4.2  Multiple  templates  at  high  resolution 

When  processing  the  different  pyramid  levels,  we  use  a 
whole  face  template  at  the  two  coarser  resolutions  and 
templates  around  the  eyes,  nose,  and  mouth  for  the  two 
Rner  resolutions.  This  template  decomposition  across 
scales  is  similar  to  Burt’s  pattern  tree  approach  [13]  for 
template  matching  on  a  pyramid  representation.  At  a 
coarse  scale,  faces  are  small,  so  full  face  templates  are 
needed  to  provide  enough  spatial  support  for  texture 
analysis.  At  a  finer  scale,  however,  individual  features  - 
eyes,  noses  -  cover  enough  area  to  provide  spatial  sup¬ 
port  for  analysis,  giving  us  the  option  to  perform  sep¬ 
arate  vectorizations.  The  advantage  of  decoupling  the 
analysis  of  the  eyes,  nose,  and  mouth  is  that  it  should 
improve  generalization  to  new  faces  not  in  the  original 
example  set.  For  example,  if  the  eyes  of  a  new  face  use 
one  set  of  linear  texture  coefhcients  and  the  nose  uses 
another,  separate  vectorization  for  the  eyes  and  nose 
provides  the  extra  flexibility  we  need.  However,  if  new 
inputs  always  come  from  people  in  the  original  example 
set,  then  this  extra  flexibility  is  not  required  and  keeping 
to  whole-face  templates  should  be  a  helpful  constraint. 

When  vectorizing  separate  eyes,  nose,  and  mouth  tem¬ 
plates  at  the  finer  two  resolutions,  the  template  of  the 
eyes  has  a  special  status  for  determining  the  scale  and 
image-plane  rotation  of  the  face.  The  eyes  template  is 
vectorized  first,  using  2  iris  features  as  anchor  points  for 
the  similarity  transform  P.  Thus,  the  eyes  vectoriza¬ 
tion  estimates  a  normalizing  similarity  transform  for  the 
face.  The  scale  and  rotation  parameters  are  then  fixed 
for  the  nose  and  mouth  vectorizations.  Only  one  anchor 
point  is  used  for  the  nose  and  mouth,  allowing  only  the 
translation  in  P  to  change. 

4.3  Example  results 

For  the  example  case  in  Fig.  11,  correspondences  from 
the  shape  component  are  plotted  over  the  four  levels 
of  the  Gaussian  pyramid.  These  segment  features  are 
generated  by  mapping  the  averaged  line  segments  from 
Fig.  2  to  the  input  image.  To  get  a  sense  of  the  fi¬ 
nal  shape/texture  representation  computed  at  the  high¬ 


est  resolution.  Fig.  12  displays  the  final  output  for  the 
Fig.  11  example.  For  the  eyes,  nose  and  mouth  tem¬ 
plates,  we  show  the  geometrically  normalized  tem¬ 
plates  ta,  and  the  reconstruction  of  those  templates 
using  the  linear  texture  coefficients.  No  images  of  this 
person  were  used  among  the  examples  used  to  create  the 
eigenspaces. 

We  have  implemented  the  hierarchical  vectorizer  in  C 
on  an  SGI  Indy  R4600  based  machine.  Once  the  example 
images  are  loaded,  multilevel  processing  takes  just  a  few 
seconds  to  execute. 

Experimental  results  presented  in  the  next  section  on 
applications  will  provide  a  more  thorough  analysis  of  the 
vectorizer. 

5  Applications 

Once  the  vectorized  representation  has  been  computed, 
how  can  one  use  it?  The  linear  texture  coefficients  can  be 
used  as  a  low-dimensional  feature  vector  for  face  recog¬ 
nition,  which  is  the  familiar  eigenimage  approach  to  face 
recognition  [37]  [2]  [26].  Our  application  of  the  vectorizer, 
however,  has  focused  on  using  the  correspondences  in  the 
shape  component.  In  this  section  we  describe  experimen¬ 
tal  results  from  applying  these  correspondences  to  two 
problems,  locating  facial  features  and  the  registration  of 
two  arbitrary  faces. 

5.1  Feature  finding 

After  vectorizing  an  input  image  ia ,  pixelwise  correspon¬ 
dence  in  the  shape  component  provides  a  dense 

mapping  from  the  standard  shape  to  the  image  ia-  Even 
though  this  dense  mapping  does  more  than  locate  just 
a  sparse  set  of  features,  we  can  sample  the  mapping  to 
locate  a  discrete  set  of  feature  points  in  ia-  To  accom¬ 
plish  this,  first,  during  off-line  example  preparation,  the 
feature  points  of  interest  are  located  manually  with  re¬ 
spect  to  the  standard  shape.  Then  after  the  run-time 
vectorization  of  ia,  the  feature  points  can  be  located  in 
ia  by  following  the  pixelwise  correspondences  and  then 
mapping  under  the  similarity  transform  P.  For  a  feature 
point  qstd  in  standard  shape,  its  corresponding  location 
in  ia  is 

P((istd  +  ya-std(4std))- 

For  example,  the  line  segment  features  of  Fig.  2  can 
be  mapped  to  the  input  by  mapping  each  endpoint,  as 
shown  for  the  test  images  in  Fig.  13. 

In  order  to  evaluate  these  segment  features  located 
by  the  vectorizer,  two  vectorizers,  one  tuned  for  a  frontal 
pose  and  one  for  a  slightly  rotated  pose,  were  each  tested 
on  separate  groups  of  62  images.  The  test  set  consists 
of  62  people,  2  views  per  person  -  a  frontal  and  slightly 
rotated  pose  -  yielding  a  combined  test  set  of  124  im¬ 
ages.  Example  results  from  the  rotated  view  vectorizer 
are  shown  in  Fig.  14.  Because  the  same  views  were  used 
as  example  views  to  construct  the  vectorizers,  a  leave- 
6-out  cross  validation  procedure  was  used  to  generate 
statistics.  That  is,  the  original  group  of  62  images  from  a 
given  pose  were  divided  into  11  randomly  chosen  groups 
(10  of  6  people,  1  of  the  remaining  2  people).  Each  group 
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Figure  11:  Evolution  of  the  shape  component  during 
coarse-to-hne  processing.  The  shape  component  is  dis¬ 
played  through  segment  features  which  are  generated  by 
mapping  the  averaged  line  segments  from  Fig.  2  to  the 
input  image. 
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Figure  12:  Final  vectorization  at  the  original  image  res¬ 
olution. 


Figure  13:  Example  features  located  by  sampling  the 
dense  set  of  shape  correspondences  found  by  the 

vectorizer. 

of  images  is  tested  using  a  different  vectorizer;  the  vec¬ 
torizer  for  group  G  is  constructed  from  an  example  set 
consisting  of  the  original  images  minus  the  set  G.  This 
allows  us  to  separate  the  people  used  as  examples  from 
those  in  the  test  set. 

Qualitatively,  the  results  were  very  good,  with  only 
one  mouth  feature  being  completely  missed  by  the  vec¬ 
torizer  (it  was  placed  between  the  mouth  and  nose).  To 
quantitatively  evaluate  the  features,  we  compared  the 
computed  segment  locations  against  manually  located 
“ground  truth”  segments,  the  same  segments  used  for 
off-line  geometrical  normalization.  To  report  statistics 
by  feature,  the  segments  in  Fig.  2  are  grouped  into  6 
features:  left  eye  (cs,  C4,  C5,  ce),  right  eye  (cg,  cio,  cn, 
C12),  left  eyebrow  (ci,  cg),  right  eyebrow  (07,  cg),  nose 
(rii,  rig,  ns),  and  mouth  (mi,  mg). 

Two  different  metrics  were  used  to  evaluate  how  close 
a  computed  segment  came  to  its  corresponding  ground 
truth  segment.  Segments  in  the  more  richly  textured  ar¬ 
eas  (e.g.  eye  segments)  have  local  grey  level  structure  at 
both  endpoints,  so  we  expect  both  endpoints  to  be  ac- 
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Figure  14:  Example  features  located  by  the  vectorizer. 


curately  placed.  Thus,  the  “point”  metric  measures  the 
two  distances  between  corresponding  segment  endpoints. 
On  the  other  hand,  some  segments  are  more  edge-like, 
such  as  eyebrows  and  mouths.  For  the  “edge”  metric 
we  measure  the  angle  between  segments  and  the  perpen¬ 
dicular  distance  from  the  midpoint  of  the  ground  truth 
segment  to  the  computed  segment. 

Next,  the  distances  between  the  manual  and  com¬ 
puted  segments  were  thresholded  to  evaluate  the  close¬ 
ness  of  Rt.  A  feature  will  be  considered  properly  detected 
when  all  of  its  constituent  segments  are  within  thresh¬ 
old.  Using  a  distance  threshold  of  10%  of  the  interocular 
distance  and  an  angle  threshold  of  20° ,  we  compute  de¬ 
tection  rates  and  average  distances  between  manual  and 
computed  segments  (Table  1).  The  eyebrow  and  nose  er¬ 
rors  are  more  of  a  misalignment  of  a  couple  points  rather 
than  a  complete  miss  (the  mouth  error  was  a  complete 
miss). 

In  the  next  section  we  consider  another  application  of 
the  shape  component  computed  by  the  vectorizer. 


5.2  Registration  of  two  arbitrary  faces 


Suppose  that  we  have  only  one  view  of  an  individual’s 
face  and  that  we  would  like  to  synthesize  other  views, 
perhaps  rotated  views  or  views  with  different  expres¬ 
sions.  These  new  “virtual”  views  could  be  used,  for  ex¬ 
ample,  to  create  an  animation  of  the  individual’s  face 
from  just  one  view.  For  the  task  of  face  recognition,  vir¬ 
tual  views  could  be  used  as  multiple  example  views  in 
a  view-based  recognizer.  In  this  section,  we  discuss  how 
the  shape  component  from  the  vectorizer  can  be  used  to 
synthesize  virtual  views.  In  addition,  these  virtual  views 
are  then  evaluated  by  plugging  them  into  a  view-based, 
pose-invariant  face  recognizer. 

To  synthesize  virtual  views,  we  need  to  have  prior 
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prototype  novel  person 


virtual  view 


Figure  15:  In  parallel  deformation,  (a)  a  2D  deformation 
representing  a  transformation  is  measured  by  finding  cor¬ 
respondence  among  prototype  images.  In  this  example, 
the  transformation  is  rotation  and  optical  flow  was  used 
to  find  a  dense  set  of  correspondences.  Next,  in  (b),  the 
flow  is  mapped  onto  the  novel  face,  and  (c)  the  novel 
face  is  2D  warped  to  a  “virtual”  view.  Figure  from  [11]. 


knowledge  of  a  facial  transformation  such  as  head  rota¬ 
tion  or  expression  change.  A  standard  approach  used  in 
the  computer  graphics  and  computer  vision  communities 
for  representing  this  prior  knowledge  is  to  use  a  3D  model 
of  the  face  (Akimoto,  Suennaga,  and  Wallace[3],  Wa¬ 
ters  and  Terzopoulos[36][39],  Aizawa,  Harashima,  and 
Saito[l],  Essa  and  Pentland  [20]).  After  the  single  avail¬ 
able  2D  image  is  texture  mapped  onto  a  3D  polygo¬ 
nal  or  multilayer  mesh  model  of  the  face,  rotated  views 
can  be  synthesized  by  rotating  the  3D  model  and  ren¬ 
dering.  In  addition,  facial  expressions  have  been  mod¬ 
eled  [36]  [39]  [20]  by  embedding  muscle  forces  that  deform 
the  3D  model  in  a  way  that  mimics  human  facial  mus¬ 
cles.  Mapping  image  data  onto  the  3D  model  is  typ¬ 
ically  solved  by  locating  corresponding  points  on  both 
the  3D  model  and  the  image  or  by  simultaneously  ac¬ 
quiring  both  the  3D  depth  and  image  data  using  the 
Cyberware  scanner. 

We  have  investigated  an  alternative  approach  that 
uses  example  2D  views  of  prototype  faces  as  a  substi¬ 
tute  for  3D  models  (Poggio  and  Vetter  [30],  Poggio  and 
Brunelli  [29],  Beymer  and  Poggio  [11]).  In  parallel  defor¬ 
mation,  one  of  the  example-based  techniques  discussed 
in  Beymer  and  Poggio  [11],  prior  knowledge  of  a  facial 
transformation  such  as  a  rotation  or  change  in  expression 
is  extracted  from  views  of  a  prototype  face  undergoing 
the  transformation.  Shown  in  Fig.  15,  first  a  2D  de¬ 
formation  representing  the  transformation  is  measured 


feature 

detection  rate 

average  distances  | 

point  metric 

edge  metric  | 

endpt.  dist. 
(pixels) 

angle 

(degrees) 

perpend,  dist. 
(pixels) 

left  eye 

100%  (124/124) 

1.24 

- 

- 

right  eye 

100%  (124/124) 

1.23 

- 

- 

left  eyebrow 

97%  (121/124) 

- 

5.1° 

1.06 

right  eyebrow 

96%  (119/124) 

- 

4.8° 

1.06 

nose 

99%  (123/124) 

1.45 

3.2° 

0.66 

mouth 

99%  (123/124) 

- 

2.2° 

0.53 

Table  1:  Detection  rates  and  average  distances  between  computed  and  “ground  truth”  segments.  Qualitatively,  the 
eyebrow  and  nose  errors  were  misalignments,  while  the  mouth  error  did  involve  a  complete  miss. 


real  views  virtual  views 


Figure  16:  Example  pairs  of  real  and  virtual  views. 


by  Rnding  correspondence  between  the  prototype  face 
images.  We  use  the  same  gradient-based  optical  flow  al¬ 
gorithm  [9]  used  in  the  vectorizer  to  find  a  dense  set  of 
pixelwise  correspondences.  Next,  the  prototype  flow  is 
mapped  onto  the  “novel”  face,  the  individual  for  which 
we  wish  to  generate  virtual  views.  This  step  requires  “in¬ 
terperson”  correspondence  between  the  prototype  and 
novel  faces.  Finally,  the  prototype  flow,  now  mapped 
onto  the  novel  face,  can  be  used  to  2D  warp  the  novel 
face  to  produce  the  virtual  view. 

The  difficult  part  of  parallel  deformation  is  automat¬ 
ically  finding  a  set  of  feature  correspondences  between 
the  prototype  and  novel  faces.  We  have  used  the  vec¬ 
torizer  to  automatically  locate  the  set  of  facial  features 
shown  in  Fig.  14  in  both  the  prototype  and  novel  faces. 
From  this  sparse  set  of  correspondences,  the  interpola¬ 
tion  technique  from  Beier  and  Neely  [5]  is  used  to  gen¬ 
erate  a  dense,  pixelwise  mapping  between  the  two  faces. 
We  then  used  the  dense  set  of  correspondences  to  map 
rotation  deformations  from  a  single  prototype  to  a  group 
of  61  other  faces  for  generating  virtual  views.  Fig.  16 
shows  some  example  pairs  of  real  and  virtual  views. 

To  evaluate  these  virtual  views,  they  were  used  as 


example  views  in  a  view-based,  pose-invariant  face  rec¬ 
ognizer  (see  [11]  for  details).  The  problem  is  this:  given 
one  real  view  of  each  person,  can  we  recognize  the  per¬ 
son  under  a  variety  of  poses?  Virtual  views  were  used  to 
generate  a  set  of  rotated  example  views  to  augment  the 
single  real  view.  Using  a  simple  view-based  approach 
that  represents  faces  with  templates  of  the  eyes,  nose, 
and  mouth,  we  were  able  to  get  a  recognition  rate  of 
85%  on  a  test  set  of  620  images  (62  people,  10  views  per 
person).  To  put  this  number  in  context,  consider  the 
recognition  results  from  a  “base”  case  of  two  views  per 
person  (the  single  real  view  plus  its  mirror  reflection)  and 
a  “best”  case  of  15  real  views  per  person.  When  tested 
on  the  same  test  set,  we  obtained  recognition  rates  of 
70%  for  the  two  views  case  and  98%  for  the  15  views 
case.  Thus,  adding  virtual  views  to  the  recognizer  in¬ 
creases  the  recognition  by  15%,  and  the  performance  of 
virtual  views  is  about  midway  between  the  base  and  best 
case  scenarios. 

6  Future  work 

In  this  section,  first  we  discuss  some  shorter-term  work 
for  the  existing  vectorizer.  This  is  followed  by  longer- 
term  ideas  for  extending  the  vectorizer  to  use  parame¬ 
terized  shape  models  and  to  handle  multiple  poses. 

6.1  Existing  vectorizer 

So  far  the  vectorizer  has  been  tested  on  face  images  shot 
against  a  solid  background.  It  would  be  nice  to  demon¬ 
strate  the  vectorizer  working  in  cluttered  environments. 
To  accomplish  this,  both  the  face  detection  and  vector¬ 
izer  should  be  made  more  robust  to  the  presense  of  false 
positive  matches.  To  improve  face  detection,  we  would 
probably  incorporate  the  learning  approaches  of  Sung 
and  Poggio  [35]  or  Moghaddam  and  Pentland  [25].  Both 
of  these  techniques  model  the  space  of  grey  level  face 
images  using  principal  components  analysis.  To  judge 
the  “faceness”  of  a  image,  they  use  a  distance  metric 
that  includes  two  terms,  “distance  from  face  space”  (see 
Turk  and  Pentland  [37]) 

||ta  -  t„|| 

and  the  Mahalanobis  distance 

l^i  =  l  A,  ’ 
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where  the  l3i  are  the  eigenspace  projection  coefhcients 
and  Xi  are  the  eigenvalues  from  principal  component 
analysis.  This  distance  metric  could  be  added  to  the 
vectorizer  as  a  threshold  test  after  the  iteration  step  has 
converged. 

Our  current  coarse-to-hne  implementation  does  not 
exploit  potential  constraints  that  could  be  passed  from 
the  coarser  to  Rner  scales.  The  only  information  cur¬ 
rently  passed  from  a  coarse  level  to  the  next  finer  level 
are  feature  locations  used  to  initialize  the  similarity 
transform  P.  This  could  be  expanded  to  help  initial¬ 
ize  the  shape  and  texture  components  at  the  finer  level 
as  well. 

6.2  Parameterized  shape  model 

In  the  current  vectorizer,  shape  is  measured  in  a  “data- 
driven”  manner  using  optical  flow.  However,  we  can  ex¬ 
plicitly  model  shape  by  taking  a  linear  combination  of 
example  shapes 


where  the  shape  of  the  ith  example  image,  is  the 

2D  warping  used  to  geometrically  normalize  the  image 
in  the  off-line  preparation  step.  This  technique  for  mod¬ 
eling  shape  is  similar  to  the  work  of  Cootes,  et  al.  [17], 
Blake  and  Isard  [12],  Baumberg  and  Hogg  [4],  and  Jones 
and  Poggio  [21].  The  new  shape  step  would,  given 
and  reference  try  to  find  a  set  of  coefhcients  ai  that 
minimizes  the  squared  error  of  the  approximation 

*a(x  +  Er=l  aiTplistdi^))  =  ta- 

This  involves  replacing  the  optical  how  calculation  with  a 
model-based  matching  procedure;  one  can  think  of  it  as  a 
parameterized  “optical  how”  calculation  that  computes 
a  single  set  of  linear  coefhcients  instead  of  a  how  vector 
at  each  point.  One  advantage  of  modeling  shape  is  the 
extra  constraint  it  provides,  as  some  “illegal”  warpings 
cannot  even  be  represented.  Additionally,  compared  to 
the  raw  how,  the  linear  shape  coefhcients  should  be  more 
amenable  for  shape  analysis  tasks  like  expression  analysis 
or  face  recognition  using  shape. 

Given  this  new  model  for  shape  in  the  vectorizer,  the 
set  of  a  shape  coefhcients  and  /3  texture  coefhcients  could 
be  used  as  a  low-dimensional  representation  for  faces. 
An  obvious  application  of  this  would  be  face  recogni¬ 
tion.  Even  without  the  modified  vectorizer  and  the  a 
coefhcients,  the  /3  coefhcients  alone  could  be  evaluated 
as  a  representation  for  a  face  recognizer. 

6.3  Multiple  poses 

The  straightforward  way  to  handle  different  out-of-plane 
image  rotations  with  the  vectorizer  is  simply  to  use  sev¬ 
eral  vectorizers,  each  tuned  to  a  different  pose.  However, 
if  we  provide  pixelwise  correspondence  between  the  stan¬ 
dard  shapes  of  the  different  vectorizers,  their  operations 
can  be  linked  together  through  image  interpolation.  The 
main  idea  is  to  interpolate  among  the  images  of  the 
different  vectorizers  to  produce  a  new  image  that  recon¬ 
structs  both  the  grey  levels  and  the  pose  of  the  input  im¬ 
age  (see  Beymer,  Shashua  and  Poggio  [10]  for  examples 


of  interpolation  across  different  poses).  Correspondence 
is  then  found  between  the  input  and  this  new  interpo¬ 
lated  image  using  optical  how.  This  correspondence,  in 
turn,  gives  us  correspondence  between  the  input  and  the 
individual  vectorizers,  so  the  input  can  be  warped  to 
each  one  for  a  combined  textural  analysis.  This  proce¬ 
dure  requires  adding  pose  to  the  existing  state  variables 
of  shape,  texture,  and  similarity  transform  P.  The  out¬ 
put  of  this  multi-pose  vectorizer  would  be  useful  for  pose 
estimation  and  pose-invariant  face  recognition. 

7  Conclusion 

In  this  paper,  we  first  introduced  a  veetonzed  image  rep¬ 
resentation,  a  feature-based  representation  where  corre¬ 
spondence  has  been  established  with  respect  to  a  refer¬ 
ence  image.  Two  image  measurements  are  made  at  the 
feature  points.  First,  feature  geometry,  or  shape,  is  rep¬ 
resented  by  the  {x,y)  feature  locations  relative  to  the 
standard  face  shape.  Second,  grey  levels,  or  texture,  are 
represented  by  mapping  image  grey  levels  onto  the  stan¬ 
dard  face  shape.  Given  this  definition,  primary  focus  of 
this  paper  is  to  explore  an  automatic  technique  for  com¬ 
puting  this  vectorized  representation  for  face  images. 

To  design  an  algorithm  for  vectorizing  images,  or  a 
“vectorizer” ,  we  observed  that  the  two  representations 
can  be  linked.  That  is,  for  textural  analysis,  the  shape 
component  can  be  used  to  geometrically  normalize  an 
image  so  that  features  are  properly  aligned.  Gonversely, 
for  shape  analysis,  the  textural  analysis  can  be  used  to 
create  a  reference  image  that  reconstructs  a  geometri¬ 
cally  normalized  version  of  the  input.  We  can  then  com¬ 
pute  shape  by  finding  correspondence  between  the  refer¬ 
ence  image,  which  is  at  standard  shape,  and  the  input. 
The  main  idea  of  our  vectorizer  is  to  exploit  the  nat¬ 
ural  feedback  between  the  texture  and  shape  computa¬ 
tions  by  iterating  back  and  forth  between  the  two  until 
the  shape/texture  representation  converges.  We  have 
demonstrated  an  efficient  implementation  of  the  vector¬ 
izer  using  a  hierarchical  coarse-to-hne  strategy. 

Two  applications  of  the  shape  component  were  ex¬ 
plored,  facial  feature  finding  and  the  registration  of  two 
faces.  In  our  feature  finding  experiments,  eyes,  nose, 
mouth,  and  eyebrow  features  were  located  in  124  test 
images  of  62  people  at  two  different  poses,  and  only  one 
mouth  feature  was  missed  by  the  system.  In  the  sec¬ 
ond  application,  one  wants  to  generate  new  views  of  a 
“novel”  face  given  just  one  view.  Prior  knowledge  of  a 
facial  transformation  such  as  a  rotation  is  represented 
by  2D  example  images  of  a  “prototype”  face  undergoing 
the  transformation.  The  problem  here  is  to  register  the 
“novel”  face  with  a  prototype  face.  We  showed  how  to 
perform  this  registration  step  using  features  located  by 
the  vectorized  shape  component. 
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